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Abstract 

There exist two major problems in application of the conventional block 
BiCGSTAB method to the 0(a)-improved Wilson-Dirac equation with mul- 
tiple right-hand-sides: One is the deviation between the true and the recur- 
sive residuals. The other is the convergence failure observed at smaller quark 
masses for enlarged number of the right-hand-sides. The block BiCGGR al- 
gorithm which was recently proposed by the authors succeeds in solving the 
former problem. In this article we show that a preconditioning technique 
allows us to improve the convergence behavior for increasing number of the 
right-hand-sides. 
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1. Introduction 

This paper is the third in a series of publications [H, 0] on a new block 
Krylov subspace method called block BiCGGR. In Ref. [l[ we proposed the 
algorithm which successfully removes the deviation between the true and the 
recursive residuals found in the block BiCGSTAB method. Reference j^j is 
devoted to the application of the new algorithm to solving the 0(a)-improved 
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Wilson-Dirac equations in lattice QCD. Although the significant cost reduc- 
tion is achieved by both the algorithmic efficiency and the cache-aware im- 
plementation technique, there remains one concern that the increase of L, 
which denotes the number of the right-hand sides, makes the convergence of 
the algorithm at lighter quark masses difficult. 

In this paper we investigate the effects of a preconditioning on the conver- 
gence properties of the block BiCGGR method in solving the 0(a)-imp roved 
Wilson-Dirac equations in lattice QCD. For a comparative purpose we em- 
ploy the same gauge configurations as in Ref. 0|. We focus on the lightest 
quark mass used in Ref. [2J, which was the most difficult case to attain the 
convergence with the block BiCGGR method. As a preconditioner we incor- 
porate the inner solver with the Jacobi method. The convergence behavior 
is examined by varying the iteration number j of the Jacobi method. We 
observe that the convergence properties are improved by the preconditioner 
so that the block BiCGGR method retains its efficiency for wider range of 
L. For j > 12 the computational cost with L = 12 is reduced down to 10% 
of that with L = 1 showing stabilized convergence behaviors. 

This paper is organized as follows. In Sec. [2] we explain the algorithmic 
details of the block BiCGGR with the inner solver of the Jacobi method. 
We present the results of the numerical tests in Sec. El Conclusions and 
discussions are summarized in Sec. HI 

2. Preconditioned block BiCGGR 

We consider to solve the linear systems with the multiple right-hand sides 
expressed as 

AX = B, (1) 

where A is an N x N complex sparse non-Hermitian matrix. X and B are 
N x L complex rectangular matrices given by 

X = (xW,...,xW,...,xW) , (2) 
B = {b^\...M\...M L) ). (3) 

In the case of the Wilson-Dirac equation the matrix dimension is given by 
N = x Ly x L z x x 3 x 4 with L r x Ly x L z x L f the volume of 
a hypercubic four-dimensional lattice. L is the number of the right-hand- 
side vectors which is called the source vectors in lattice QCD. Throughout 
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this paper the specific matrix structure of the 0(a)-improved Wilson-Dirac 
equation is not necessary. We refer the readers who may be interested in it 
to Sec. 2 of Ref. Q. 

The details of the block BiCGGR algorithm are presented in Refs. [l], 0|. 
The preconditioned block BiCGGR method with M an N x N precondition- 
ing matrix such that M ~ A^ 1 is described as follows: 

X G C NxL is an initial guess, 
Compute R = B — AX , 
Set Pq = i?o, 
Choose R eC NxL , 
Preconditioning part: F = MR , 
Set V = W = AF , 

For k—0,1,..., until max(||rf || 2 /||& (i) || 2 ) < £ do: 

i 

Solve (R^V k )a k = R^R k for a k , 
( k = Tr[W^R k }/Tr[W k H W k }, 

S k = P k — (kV k , 

U k = S k a k , 
Preconditioning part: G k = MU k , 

Y k = AG k , 
X k+ \ = X k + ( k F k + G k , 

Rk+l = Rk — CkW k — Y k , 

Preconditioning part: F k+ i = MR k+ i, 
W k+1 = AF k+ i, 

Solve (R$R k )>y k = R$R k+1 /( k for 7 fc , 

Pk+l = Rk+l + Uk'-fk, 

Vfc+i = W k+1 + Y" fc 7 fc , 
End for. 

In this paper we employ the Jacobi method as a preconditioner of the block 
BiCGGR method because of its practical parallelizability. In this case the 
matrix G k = MU k in the algorithm is calculated by the Jacobi method as 
follows: 
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3-1 

(/ - A^AyG k , + 52(1- A D ' A) 1 A D 'U k , j>l, 

i=0 



where j, Gk,o, and Ad denote the number of iterations of the Jacobi method, 
the initial guess for the Jacobi method, and the diagonal part of the coeffi- 
cient matrix A, respectively. The preconditioning part of the block BiCGGR 
algorithm is computed by the matrix-vector multiplications. 

The dominant part of the memory requirements, which is proportional 
to N, is given by 16iV(51 + 9L) Bytes without an additional contribution 
from the preconditioner. In a practical sense it would be sufficient that the 
effectiveness of the preconditioner is retained up to L ~ 10, because the 
memory requirements may become a constraint on the applicability of the 
block BiCGGR method once L goes beyond 10. 



3. Numerical tests 

3.1. Choice of parameters 

We employ the same quenched gauge configurations as in Ref . |2[ , which 
are the statistically independent 10 samples generated with the Iwasaki gauge 
action at ft — 2.575 on a L x x L y x L z x L t = 16 x 16 x 16 x 32 lattice. 
We choose one hopping parameter k = 0.1359 for the Wilson-Dirac equation 
with the improvement coefficient csw — 1-345. The bare quark mass is 
defined by m g = (1/k - 1/k c )/2 with n c = 0.136116(8). Note that this 
hopping parameter gives the lightest quark mass in Ref. [sj which was the 
most problematic case to achieve the convergence with the block BiCGGR 
method for the fixed L. According to the results in Ref. [3], the physical pion 
mass is 221 MeV with m n /mp = 0.28 at k = 0.1359. The lattice spacing is 
a = 0.1130 fm determined by m p . 

3.2. Test environment 

Numerical tests are performed on single node of a large-scale cluster sys- 
tem called T2K-Tsukuba which was also employed in the previous study [2j. 
The machine consists of 648 compute nodes providing 95.4Tflops of comput- 
ing capability. Each node consists of quad-socket, 2.3GHz Quad-Core AMD 
Opteron Model 8356 processors whose on-chip cache sizes are 64KBytes/core, 
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512KBytes/core, 2MB/chip for LI, L2, L3, respectively. Each processor has 
a direct connect memory interface to an 8GBytes DDR2-667 memory and 
three hypertransport links to connect other processors. All the nodes in 
the system are connected through a full-bisectional fat-tree network consist- 
ing of four interconnection links of 8GBytes/sec aggregate bandwidth with 
Infiniband. 

3.3. Results 

In Table CD we list the outer iteration number to solve the Wilson-Dirac 
equation with the preconditioned block BiCGGR algorithm as a function of 
L and the inner iteration number j. The initial guess for the block BiCGGR 
method and the Jacobi method is set to the zero matrix. The matrix Rq for 
the block BiCGGR method is chosen as Rq. We employ rather stringent tol- 
erance of maxj(||rj^ l^/Hb^ H2) < 10~ 14 with r% the recursive residual in the 
k-th outer iteration and a unit vector whose i-th. component is unity. The 
results are averaged over 10 configuration samples. In some combinations of 
L and j we find the convergence failure: The residual ceases decreasing and 
starts to increase gradually without reaching the tolerance. In this case we 
give the number of the configuration samples which show the convergence 
failure in Table [H Note that the total number of the matrix- vector multipli- 
cation denoted by ^MVM is given by the formula of 2[(j + l)k + 1]L. This 
should be a more appropriate quantity to be compared. We give #MVM/L 
within the parentheses in each entry of Table [IJ Most important point is 
that we are allowed to achieve the convergence for enlarged L as the inner 
iteration number increases. Secondly, #MVM/L can be reduced with an ap- 
propriate choice of L and j. To illustrate the convergence behaviors we plot 
maXjQIrj^l^/H&^l^) as a function of the outer iteration number k choosing 
one configuration sample as a representative case. Figure [1] shows the L de- 
pendence with j fixed at twelve. We observe a characteristic feature that the 
convergence behaviors for different L are almost identical up to some itera- 
tion number, beyond which the convergence speed for larger L is suddenly 
accelerated. In Fig. [2] we plot the j dependence for the L = 12 case. For 
j = the iteration is terminated when the residual of maxj(||r*^ l^/H&^lh) 
reaches 10 2 without achieving the convergence. It may be surprising that 
both figures show a quite similar feature under the exchange of j and L. 

In Table [2] we present the execution time divided by L as a function of 
L and j. A remarkable cost reduction is observed. The best case is the 
combination of L = 12 and j = 12 where the cost is just 10% of that for the 



5 



unimproved case with L = 1 and j — 0. In a practical use it is reasonable 
to choose L = 12 with j > 12 as default parameters: We observe that the 
stabilized convergence properties with less execution time. If the convergence 
is failed by some possibility, you just repeat the inversion with smaller L. 

There are two key ingredients for this remarkable achievement. One is 
the algorithmic improvements thanks to the block BiCGGR: For the given 
value of j, #MVM/L monotonically decreases as a function of L. The other 
is the efficiency of the cache-aware implementation technique for multiple L. 
The situations are depicted in Fig. [3] with the choice of j = 12. 

Table 1: Outer iteration number as a function of L and the inner iteration number j to 
solve the Wilson-Dirac equation with the preconditioned block BiCGGR method. Results 
are averaged over 10 configuration samples. Fail means how many configuration samples 
out of ten show the convergence failure. #MVM/L is given in the parentheses. 



k = 0.1359 

L j = j = 6 j = 12 j = 18 j = 24 j = 30 



1 


2437.4 
(4875.8) 


536.4 
(7510.6) 


309.4 
(8045.4) 


224.4 
(8528.2) 


176.4 
(8821.0) 


145.8 
(9040.6) 


2 


1713.4 
(3427.8) 


342.0 
(4789.0) 


194.1 
(5047.6) 


140.3 
(5332.4) 


110.0 
(5501.0) 


90.5 
(5612.0) 


4 


Fail: 7/10 


Fail: 1/10 


132.3 
(3440.8) 


92.7 
(3523.6) 


73.5 
(3676.0) 


61.2 
(3795.4) 


6 


Fail: 10/10 


184.5 
(2584.0) 


107.1 

(2785.6) 


75.9 
(2885.2) 


59.6 
(2981.0) 


49.3 
(3057.6) 


8 


Fail: 10/10 


Fail: 1/10 


91.4 
(2377.4) 


66.8 
(2539.4) 


52.2 
(2611.0) 


43.3 
(2685.6) 


10 


Fail: 10/10 


Fail: 1/10 


83.3 
(2166.8) 


60.5 
(2300.0) 


48.5 
(2426.0) 


40.1 
(2487.2) 


12 


Fail: 10/10 


144.2 
(2019.8) 


78.3 
(2036.8) 


57.5 
(2186.0) 


46.4 
(2321.0) 


Fail: 1/10 



4. Conclusions and discussions 

In this paper we present an evidence that the convergence behavior of 
the block BiCGGR can be improved by the preconditioning technique. Our 
numerical tests show that the rank loss problem is remedied by the use of the 
inner solver with the Jacobi method as a preconditioner. As an optimized 
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Figure 1: L dependence of the convergence behaviors for the j — 12 case. All the mea- 
surements are performed on the same configuration. 



choice of L and the inner iteration j we can achieve 90% cost reduction in 
terms of the execution time. 

There remains a couple of future works. Firstly, it is worthwhile to search 
a better preconditioner which assures the convergence for wider range of L 
with less computational cost. Secondly, it is important to investigate why 
the preconditioner allows us to avoid the rank loss problem. Thirdly, we 
plan to apply the preconditioned block BiCGGR method to one of the state- 
of-the-art gauge configurations generated by the PACS-CS Collaboration [ij]. 
Fourthly, it is interesting to make a direct comparison of the algorithmic effi- 
ciency between the preconditioned block BiCGGR method and other multiple 
right-hand-side methods^, @, 0, @]- 
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Fail: 1/10 
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